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Abstract. In this paper, we present an automated technique swati; Synthesizing 
Wordlengths Automatically Using Testing and Induction, which uses a combi- 
nation of Nelder-Mead optimization based testing, and induction from examples 
to automatically synthesize optimal fixedpoint implementation of numerical rou- 
tines. The design of numerical software is commonly done using floating-point 
arithmetic in design-environments such as Matlab. However, these designs are 
often implemented using fixed-point arithmetic for speed and efficiency reasons 
especially in embedded systems. The fixed-point implementation reduces imple- 
mentation cost, provides better performance, and reduces power consumption. 
The conversion from floating-point designs to fixed-point code is subject to two 
opposing constraints; (i) the word- width of fixed-point types must be minimized, 
and (ii) the outputs of the fixed-point program must be accurate. In this paper, 
we propose a new solution to this problem. Our technique takes the floating-point 
program, specified accuracy and an implementation cost model and provides the 
fixed-point program with specified accuracy and optimal implementation cost. 
We demonstrate the effectiveness of our approach on a set of examples from the 
domain of automated control, robotics and digital signal processing. 



1 Introduction 

Numerical software forms a critical component of embedded systems such as robotics, 
automated control and digital signal processing. These numerical routines have two 
important characteristics. First, these routines are procedures that compute some math- 
ematical functions designed ignoring precision issues of fixed-point arithmetic. Design 
environments such as Simulink/Stateflow and Lab VIEW allow design and simulation 
of numerical routines using floating-point arithmetic that closely resembles the more 
intuitive real arithmetic. Second, the implementation of these numerical routines run in 
resource-constrained environments, requiring their optimization for low resource cost 
and high performance. It is common for embedded platforms to have processors with- 
out floating-point units due to their added cost and performance penalty. The signal 
processing/control engineer must thus redesign her floating-point program to instead 
use fixed-point arithmetic. Each floating-point variable and operation in the original 
program is simply replaced by a corresponding fixed-point variable and operation, so 
the basic structure of the program does not change. The tricky part of the redesign pro- 
cess is to find the optimal fixed-point types, viz., the optimal wordlengths (bit-widths) of 
fixed-point variables, so that the implementation on the platform is optimal — lowest 
cost and highest performance — arid the resulting fixed-point program is sufficiently 
accurate. The following novel contributions are made in this paper to address this 
problem; 



- We present a new approach for inductive synthesis of fixed-point programs from 
floating-point versions. The novelty stems in part from our use of optimization: 
we not only use optimization routines to minimize fixed-point types (bit-widths of 
fixed-point variables), as previous approaches have, but also show how to use an 
optimization oracle to systematically test the program and generate input-output 
examples for inductive synthesis. 

- We illustrate the practical effectiveness of our technique on programs drawn from 
the domains of digital signal processing and control theory. For the control theory 
examples, we not only exhibit the synthesized fixed-point programs, but also show 
that these programs, when integrated in a feedback loop with the rest of the system, 
perform as accurately as the original floating-point versions. 

2 Preliminaries 

Floating-point arithmetic |8J is a system for approximately representing real numbers 
that supports a wide range of values. It approximates a real number using a fixed num- 
ber of significant digits scaled using an exponent. The floating-point system is so called 
because the radix point can float anywhere relative to the significant digits of the num- 
ber This is in contrast to fixed-point arithmetic ||23l in which there are a fixed number 
of digits and the radix point is also fixed. Due to this feature, a floating-point repre- 
sentation can represent a much wider range of values with the same number of digits. 
The most common floating-point representation used in computers is that defined by 
the IEEE 754 Standard [l ']. The storage layout of the floating-point numbers consist of 
three basic components: the sign, the exponent, and the mantissa. The storage layout of 
the single-precision and double-precision floating point numbers is presented in Table[T] 

- The sign bit is for a positive number and 1 for a negative number Flipping the 
value of this bit flips the sign of the number 

- The mantissa, also known as the significand, represents the precision bits of the 
number. It is composed of an implicit leading bit and the fraction bits. In order to 
maximize the quantity of representable numbers, floating-point numbers are typi- 
cally stored with the radix point after the first non-zero digit. In base 2, the only 
possible non-zero digit is 1. Thus, we can just assume a leading digit of 1, and 
don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits 
of resolution, by way of 23 fraction bits in single-precision floating-point numbers, 
and 53 bits of resolution, by way of 52 fractional bits in double-precision. 

- The exponent field needs to represent both positive and negative exponents. To do 
this, a bias is added to the actual exponent in order to get the stored exponent. For 
IEEE single-precision floats, this value is 127. Thus, an exponent of means that 
127 is stored in the exponent field. A stored value of 200 indicates an exponent of 
(200 - 127), or 73. Exponents of -127 (all Os) and +128 (afl Is) are reserved for 
special numbers. For double precision, the exponent field is 11 bits, and has a bias 
of 1023. 

Floating-point solves a number of representation problems . Fixed-point has a fixed win- 
dow of representation, which limits it from representing very large or very small num- 
bers. Floating-point, on the other hand, employs a sort of "sliding window" of precision 
appropriate to the scale of the number The range of positive floating-point numbers can 
be split into normalized numbers (which preserve the full precision of the mantissa), and 





Sign 


Exponent 


Fraction 


Bias 


Single Precision 


1[31] 


8 [30 - 23] 


23 [22 - 00] 


127 


Double Precision 


1 [63] 


11 [62 - 52] 


52 [51 - 00] 


1023 



Table 1 : Floating-point Number Layout 



denormalized numbers. The denormalized numbers do not have an implicit leading bit 
of 1 and allow representation of really small numbers but with only a portion of the 
fraction's precision. The exponent of all Os (—127) and all Is (128) are reserved for 
denormalized numbers and representing infinity respectively. A complete discussion on 
the semantics of floating-point operations can be found in the IEEE 754 Standard H]. A 
floating-point unit (FPU) is used to carry out operations on floating-point numbers such 
as addition, subtraction, multiplication, division and square root. FPUs are integrated 
with CPUs in computers but most embedded processors do not have hardware support 
for floating-point operations. Emulation of floating-point operations without hardware 
support can be very slow. Inclusion of FPUs also increases the power consumption of 
the processors. This has made the use of fixed-point arithmetic very common in em- 
bedded systems. In spite of the benefits of floating-point arithmetic, embedded systems 
often use fixed-point arithmetic to reduce resource cost and improve performance. A 
fixed-point number consists of a sign mode bit, an integer part and a fractional part. We 
denote the fixed-point type of a variable x by fxT(a;). Formally, a fixed-point type is a 
triple: 

(Signedness, IWL, FWL). 

The sign mode bit Signedness is if the data is unsigned and is 1 if the data is 
signed. The length of the integer part is called the integer wordlength (IWL) and the 
length of the fractional part is called the fractional wordlength (FWL). The fixed-point 
wordlength (WL) is the sum of the integer wordlength and fractional wordlength; that 
is, WL = IWL + FWL. A fixed-point number with fractional word length (FWL) is scaled 
by a factor of 1/2™. For example, a fixed point number OHIO with as Signedness 
, integer wordlength of 3 and fractional wordlength of 2 represents 14 x 1/2^, that is, 
3.5. Converting a fixed-point number with scaling factor R to another type with scaling 
factor S, requires multiplying the underlying integer by R and dividing by S; that is, 
multiplying by the ratio R/S. For example, converting OHIO with as Signedness, 
integer wordlength of 2 and fractional wordlength of 2 into a fixed-point number with 
as Signedness, integer wordlength of 2 and fractional wordlength of 3 requires mul- 
tiplying with 2'^/2^ to obtain 011100. If the scaling factor is to be reduced, the new 
integer will have to be rounded. For example, converting the same fixed-point number 
OHIO to a fixed-point number with fractional wordlength of and integer wordlength 
of 2 yields Oil, that is, 3 which is obtained by rounding down from 3.5. The operations 
supported by fixed-point arithmetic are the same as those in the floating-point arithmetic 
standard fl] but the semantics can differ on custom hardware. For example, the round- 
ing mode for arithmetic operations could be different, and the result could be specified 
to saturate or overflow/underflow in case the wordlength of a variable is not sufficient 
to store a computed result. One complete semantics of fixed-point operation is provided 
with the Fixed-point Toolbox in Matlab 121 . The range of the fixed-point number is 
much smaller compared to the range of floating-point numbers for the same number of 
bits since the radix point is fixed and no dynamic adjustment of precision is possible. 



Translating a floating-point program into fixed-point program is non-trivial and requires 
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Single-precision Floating-point 


± 10-" *^^ to 10^^ ''^ 


Adaptive 


Fixed-point type: (1, 8, 24) 


-10^ "" to 10^ "" 




Fixed-point type: (1, 16, 16) 


-10* -'^ to lO*" ''^ 
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Table 2: Range of 32 bit fixed-point and floating-point numbers 



careful consideration of loss of precision and range. The integer wordlengths and frac- 
tional wordlengths of the fixed-point variables need to be carefully selected to ensure 
that the computation remains accurate to a specified threshold. 

3 Problem Definition 

We introduce a simple illustrative example to explain the problem of synthesizing an 
optimal fixed-point program from a floating-point program, and then present the formal 
problem definition. 



3.1 Floating-point Implementation 

Given a floating-point program, we need to synthesize fixed-point type for each 
floating-point variable. 

Example 1: The floating-point program in this example [T] takes radius as the 
input, and computes the corresponding area of the circle. Notice that the fixed-point 
program is essentially identical to the floating-point version, except that the fixed-point 
types of variables mypi, radius, t and area must be identified. Recall that the 
fixed-point type is a triple (sj, iwlj, fwlj) for j-th variable where sj denotes the 
Signedness of the variable, iwlj denotes the integer wordlength and fwlj denotes the 
fraction wordlength. We use Ffi (X) to denote the floating-point program with inputs 



Procedure 1 Floating-point program to compute circle area 
Input: radius 
Output: area 

double mypi, radius, t, area 

mypi = 3.14159265358979323846 

t = radius x radius 

area — mypi x t 

return area 



Procedure 2 Fixed-point program to compute circle area 



Input: radius, (sj , iwlj , fwl-,) for j — 1,2,3,4 
Output: area 

fx(si, iwli, fwli) mypi 

fx(s2, iwl2, fwb) radius 

fx(s3, iwls, fwls) t 

fx(s4, iwU, fwU) area 

mypi = 3.14159265358979323846 

t = radius x radius 

area = mypi x t 

return area 



X = {xi, X2, ■ ■ ■ , Xn)- Ffx{X, fxr) denotes the fixed-point version of the program, 
where the fixed-point type of a variable x ^ X is fKT{x). Note that the fixed-point 
types in Fj^iX, fxr) are defined by the mapping fxr. 



3.2 Input Domain 

The context in which a fixed-point program Ffx {X, fxr) is executed often provides a 
precondition that must be satisfied by valid inputs {xi,X2, ■ • ■ , Xn)- This defines the 
input domain denoted by Dom{X). 

Example 2: In our example of computing area of a circle, suppose that we are only 
interested in the radii in the range [0.1, 2.0). Then, the input domain _Do?n (radius) is 

radius > 0.1 A radius < 2.0 



3.3 Correctness Condition for Accuracy 

The correctness condition specifies an error function Err{Ffi{X), Ffx(X, fxr)), and 
a maximum error threshold maxError. The error function and error threshold together 
define a bound on the "distance" between outputs generated by the floating-point and 
fixed-point programs respectively. An accurate fixed-point program is one whose error 
function lies within the error threshold for all inputs in the input domain. Some common 
error functions are: 

• Absolute difference between the floating-point function and fixed-point function: 
\Ffi{X) - Ff^{X,f^T)\ 

• Relative difference between the floating-point function and fixed-point function: 

Ff,{X}~Ff^{X,f,cT) 



FjiiX) 

Moderated relative difference: 



This approaches the relative 



F;i(X)~Ff,(XS^T) 

Ffi{X)+S 

difference for Ffi{X) » 5 and approaches a weighted absolute difference for 
Ffi{X) << S. When Ffi{X) can be zero for some values of X, the moderated 
relative difference remains bounded unlike the relative difference which becomes 
unbounded. 



The correctness condition for accuracy requires that for all inputs in the provided 
input domain Dom{X), the error function Err{Ffi{X), Ffx{X,ixT)) is below the 
specified threshold maxError; i.e., 

\fX e Dom{X) . Err{Ffi{X),Ff^{X,iyiT)) < maxError 

Example 3: In our running example of computing area of a circle, the error function 
is chosen to be relative difference, the error threshold 0.01, and thus the correctness 
condition is Vradius, s.t. radius > 0.1 A radius < 2.0 

-Ff; (radius) — F/-,. (radius, fxr) 
F/j (radius) 

3.4 Implementation Cost Model 

The cost model of the fixed-point program is a function mapping fixed-point 
types to a real number For a given fixed-point program Ffx{X,{x.T), let T = 
{ti, t2, ■ ■ ■ , tk} be the set of fixed-point program variables with corresponding types 
{fxr(ti), fxT(t2)) • ■ • , fxT(ifc)}. Then the cost model (or simply cost) of Ffx is a func- 
tion 

cost : (fxT(ti), fxr(t2), ■ • ■ , fxT(tfe)) ^ K 

In practice, cost is often just a function of the total wordlengths (WL = IWL + FWL) of 
the variables. It can incorporate hardware implementation metric s such as a rea, power 
and delay. A number of cost models are available in the literature III6II7I5I6I . and all of 
these can be used in our approach. 

Example 4: The cost model proposed by Constantinides et al ||6l for the running exam- 
ple yields the following cost function. We use this cost model in all our examples. 

cost(fxr(mypi), fxr(radius), fxr(t), fxr(area)) — 

cdelay(WL(mypi)) + cmul(WL(radius), WL(radius), WL(t)) 
+cmul(WL(mypi), WL(t), WL(area)) , where 
cdelay(i) = Z + 1 and cmul(Zi, Z2, = 0.6 X {h + 1) * h - 0.85 * {h + h - I) 

The area of a multiplier grows almost linearly with both the coefficients and the 
data wordlength. The first term in the Constantinides model represents this cost. The 
second term represents the area cost of computational elements required only for carry 
propagation. The coefficients 0.6 and 0.85 were obtained through least-squared fitting 
to area of several hundred multipliers of different coefficient value and width ||6l. 



3.5 Problem Definition 

Definition 1 (Optimal Fixed-point Types Synthesis). The optimal fixed- 
point types synthesis problem is as follows. Given a floating-point program 
Ffx(X,ix.T{T)) with variables T, an input domain Dom{X), a correct- 
ness condition Err{Ffi{X), Ffx{X^ix.T{T))) < maxError, and a cost model 
cost(fxT(ii), fxr(t2), ■ • • , fxT(tfc)), the optimal fixed-point types synthesis problem 
is to discover fixed-point types 



fxr*(T) = {fxr*(ti),fxr*(t2), • ■ • ,fxr*(tfe)} 



such that the fixed-point program Ffi{X) with the above types for fixed-point variables 
satisfies the correctness condition for accuracy, that is, 

(a) VX e Dom{X) . Err{Ffi{X), Ff^{X,txT* (T))) < maxError 

and has minimal cost with respect to the given cost function among all fixed-point types 
that satisfy condition (a), that is, 

(b) fxT* = argmin cost(fxT(T)) 

fxr satisfies (a) 

Our goal is to automated this search for optimal fixed-point types. We illustrate this 
problem using the running example below. 

Example 5: In our running example of computing the area of a circle, we need 
to discover fxT*(mypi), fxr* (radius), fxr*(t) and fxT*(area) such that 

(a) the fixed-point program with the given fixed-point types satisfies the correctness 
condition; that is, Vradius, s.t., radius > 0.1 A radius < 2.0 

F/;(radius) - Fjg, (radius, fxr*) ^ q 
_F/i (radius) ~ 

(b) and the cost is minimized; that is, 

fxT* = argmin cost(fxT(mypi, radius, t, area)) 

fxr satisfies (a) 

We use this example to illustrate the trade-off between cost and error and how a human 
might use trial and error to discover the correct wordlengths. We vary the wordlength 
of the variables. The integer wordlength is selected to avoid overflow and the remaining 
bits are used for fractional wordlength. 

Case 1 (Figure[TJ: WL = 8 for all variables. fxT(mypi) = (0, 2, 6), fxT(radius) = 
(0, 1, 7), fxT(t) = (0, 2, 6), fxT(area) = (0, 4, 4). Cost is 81.80. 
Case 2 (Figure |2]i: WL = 12 for all variables, fxr(mypi) = 
(0,2, 10),fxT(radius) = (0, 1, 11), fxT(t) = (0, 2, 10), fxr(area) = (0,4, 8). Cost 
is 179.80. 

Case 3 (Figure |3]i: WL = 16 for all variables. fxT(mypi) = 
(0,2, 14),fxT(radius) = (0, 1, 15), fxT(t) = (0, 2, 14), fxT(area) = (0,4,12). 
Cost is 316.20. 

As we will show in the next section, our approach computes fixed-point types that 
meet the accuracy threshold and yield a cost of only 104.65, which, while being less 
than the cost in Case 1, satisfies the correctness criterion like Case 2. In the following 
section, we discuss our automated approach to solve this problem. 

4 Our Approach 

A central idea behind our approach, swati is to identify a small set of interesting inputs 
S{X) using testing from the input domain Dom{X) such that the optimal implementa- 
tion found using induction that satisfies the correctness condition for the inputs in S{X) 
will be optimal and correct for all inputs in the given input domain Dom{X). 
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Fig. 1 : Error for WL = 8. Error threshold at 0.01 is violated. 
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Fig. 2; WL = 12. Error threshold at 0.01 is violated. 



The top-level synthesis algorithm is presented in Procedure [3] VL„iax is an upper 
bound on wordlengths beyond which it is non-optimal to use the fixed-point version. 
The algorithm starts with a randomly selected set of examples from the given input 
domain. Then, a fixed-point implementation that satisfies the accuracy condition for 
each of these inputs and is of minimal cost is synthesized using the routine optlnduce. 
If no such implementation is found, the algorithm reports INFEASIBLE. Otherwise, the 
testing routine testErr checks whether the implementation fails the correctness con- 
dition for any input. If so, a set of inputs Bad'^ on which the implementation violates 
the correctness condition are added to the set 5* used for synthesis, and the process 
is repeated. If the correctness condition is satisfied, the resulting fixed-point types are 
output. In the rest of this section, we describe the main components of our approach in 
detail, including the theoretical result. 
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Fig. 3; WL = 16. Error threshold at 0.01 is not violated. 



4.1 Synthesizing Optimal Types for a Finite Input Set 

The opt Induce function (see Procedure|4) is used to obtain optimum fixed-point types 
such that the fixed-point program with these types satisfies the correctness condition 
for a finite input set S and has minimal cost. First, the floating-point program Fji is 
executed for all the inputs in the sample S and the range of each variable ti as well as its 
Signedness is recorded by the functions getRange and isSigned respectively. Then, 
the integer wordlength IWL sufficient to represent the computed range is assigned to 
each variable ti and the Signedness is 1 if the variable takes both positive and negative 
values, and otherwise. If the fixed-point program with maximum wordlengths \lL„iax 
fails the correctness condition, we conclude that the synthesis is not feasible and return 
_L. If not, we search for the wordlength with minimum cost satisfying the correctness 
condition using our optimization oracle Os- The result is used to compute the fractional 
wordlengths, and the resulting fixed-point types are returned. 

More precisely, Os solves the following optimization problem over fxr: 
Minimize cost(fxr) s.t. 

/y Err{Ff^{x,{xT),Ffi{x)) < maxError (1) 

Let us reflect on the nature of the above optimization problem. The overall synthesis 
algorithm might make several calls to Os for solving the optimization problem for 
different sets of inputs and hence, Os must be a fast procedure. But it is a discrete 
optimization problem with a non-convex constraint space, a problem class that is known 
to be computationally hard Q. This rules out any computationally efficient algorithm 
to implement Os without sacrificing correctness guarantees. Since the space of possible 
types grows exponentially with the number of variables, brute-force search techniques 
will not scale beyond a few variables. Satisfiability solvers can also not be directly 
exploited to search for optimal wordlengths since the existential quantification is over 
the types and not the variables. The arithmetic operators have different semantics when 
operating on operands with different types and hence, the only way to encode this search 
problem as a satisfiability problem is to case-split exhaustively on all possible types 



Procedure 3 Overall Synthesis Algorithm: swati 



Input: Floating-point program Ffp, 

Fixed-point program Fjx witii fixed-point variables T, 

Domain of inputs Dom, Error function Err, 

maximum error threshold maxError, Cost Model cost, 

maximum wordlengths VLmax 
Output: Fixed-point type fxr for variables T 
or INFEASIBLE 

S'^ = random sample from Dom, BadP — S'^ ,i — Q 

while Bad' / do 

= S'-i U Bad'-^ 
fxr* = optlnduce{Ffp, Ffx, Dom, Err, maxError , 

COStjWLmai:, S') 

if fxr ' = ± then 

return infeasible 
end if 

Bad' = testErr(F/p, Ff^, fxr', Dom, Err, maxError) 
end while 

return fxr* — fxr* 



(word-lengths), where each case encodes the fixed-point program with one possible 
type. The number of such cases is exponential in the number of the variables in the 
program under synthesis and hence, SAT problems will be themselves exponentially 
large in size. Further, one would need to invoke SAT solvers multiple times in order 
to optimize the cost function. Thus, satisfiability solving would be a wrong choice to 
address this problem. Further, the space of possible types is also not totally ordered 
and hence, binary search like techniques would also not work. For a binary search like 
technique to work, we will need to define a domination ordering over the types which 
has three properties. Firstly, it is a total ordering relation. Secondly, if a particular type 
assignment satisfies the correctness condition for all inputs then all dominating types 
satisfy the correctness condition for all inputs. Thirdly, the cost function is monotonic 
with respect to the domination ordering relation. In general, this may not be feasible for 
any given floating-point program and cost function. Hence, we implement Os using a 
greedy procedure getMinCostWL presented in Procedure |5] 

4.2 Verifying a Candidate Fixed-Point Program 

In order to verify that the fixed-point program Ffx{X,ix.T) satisfies the correctness 
condition, we need to check if the following logical formula is satisfiable. 

3X e Dom{X) Err{Ffx{X,{xT), Ffp{X)) > maxError (2) 

If the formula is unsatisfiable, there is no input on which the fixed-point program vio- 
lates the correctness condition. 

For arbitrary (possibly non-linear) floating-point and fixed-point arithmetic opera- 
tions, it is extremely difficult to solve such a problem in practice with current constraint 
solvers. Instead, we use a novel optimization-based approach to verify the candidate 
fixed-point program. The intuition behind using an optimization-based approach is that 



Procedure 4 Optimal Fixed-Point Types Synthesis: opt Induce 



Input: Floating-point program Ffp, 

Fixed-point program Fj^ with fixed-point variables T, Domain of inputs Dom, Error function 

Err, 

maximum error threshold maxError, Cost Model cost, 
max wordlengths VLm.ax, Input S 
Output: Optimal wordlengths WL for inputs or _L 
for all fixed-point variable ti in Ffx do 

IVLiU) = [log(getRange(i,,F/,,S') + 1)] 
Signedness(ti) — isSigaed{ti, Ffi, S) 
end for 

ifWL^a^ < IWLthen 

return _L 
end if 

fxr = (Signedness, IWL,WLm.ai ^ IVL) 

it Err{Ffp{x), Ffx{x,ixT)) > maxError then 

return _L 
end if 

WL — getMinCostWL(_F/p, Ffx, Dom, Srr, maxError, 

cost, WLmaa:, S" , IWL, Signedness) 
return fxr — (Signedness, IWL,WL — IWL) 



Procedure 5 getMinCostWL 

Input: Floating-point program Ffp, 

Fixed-point program Ff^ with fixed-point variables T, 
Domain of inputs Dom, Error function Err, 
maximum error threshold maxErr, Cost Model cost, 
max wordlengths VLm.ax, Input S 

Output: Optimal wordlengths WL 
valcandVL — {VLmax} 
while valcandVL is not empty do 
WL = argmin cost(ucWL) 

vc\lL(^valcand\iL 

fxr = (Signedness, IWL,WL - IWL> 

candML = 0, valcandVL — 

for all fixed-point variable ti in Ffx do 

WL'"(j) = WL(j) Vj / i, \lV~{i) = WL(i) - 1 
WL'+(j) = WL(j) Vj / i, VV+{i) = WL(i) + 1 
candVL = candVL U {WL'" , WL'+} 

end for 

for all cand in candVL do 

candfxT = (Signedness, IWL, candWL — IWL) 
it Err{Ffp{x), Ffx{x, cand)) < maxErrVx G S 
and cost(cancifxr) < cost (fxr) then 

valcandVL = valcandVL U {mnd} 
end if 
end for 
end while 
return fxr 



the error function is continuous in the inputs or with very few discontinuities 118141 . 
and hence, optimization routines can easily find inputs which maximize error function 
by starting from some random input and gradually adjusting the output to increase the 
value of the error function. The optimization oracle Oy is used to maximize the error 
function Err{Ffx{X,ix.T), Ffp{X)) over the domain Dom{X). If there is no input 
X S Dom{X) for which the error function exceeds maxError, the fixed-point pro- 
gram is correct and we terminate. Otherwise, we obtain an example input on which 
the fixed-point program violates the correctness condition. Multiple inputs can also be 
generated where they exist. 

In practice, with the current state-of-the-art optimization routines, it is difficult to 
implement Oy to find a global optimum. Instead, we use a numerical optimization 
routine based on the Nelder-Mead method [19] which can handle arbitrary non-linear 
functions and generates local optima. Procedure |6]defines testErr which invokes the 
Nelder-Mead routine (indicated by "argmaxlocal"). This routine requires one to supply 
a starting value of X, which we generate randomly. To find multiple inputs, we invoke 
the routine from from different random initial points and record all example inputs 
on which the fixed-point program violates the correctness condition. Since a global 
optimum is not guaranteed, we repeat this search maxAttempts times before declaring 
that the fixed-point program is correct. 



Procedure 6 Verification Routine testErr 

Input: Floating-point program F/p, 

Fixed-point program Ff^, Fixed-point type fxr. 

Domain of inputs Dom, Error function Err, 

maximum error threshold maxError 
Output: Inputs Bad on which Ff x violates correctness condition 

Bad = 

while i < maxAttempts do 

i = i + 1, Xq = random sample from Dom 

Xcand = eLigmsLx\ocsL\{Err{Ffp{X),Ffx{X,fTiT)),Xo) 

X 

it Err{Ffp{Xcand), Ffx{Xcand, fxr)) > maxError and X G Dom then 

Bad = Bad U {X} 
end if 
end while 



4.3 Illustration on Running Example 

The initial sample 5° is of size 10. maxAttempts in the verification routine was 
also set to 10. The number of Our algorithm took 4 iterations. We record the in- 
termediate implementations produced by synthMinCost in each of the last 3 iter- 
ations taken by our algorithm. The initial selected random sample is of size 10 and 
the MAXATTEMPTS was set to 10. The number of samples used in each subse- 
quent step of iteration after adding examples discovered by testErr procedure was 
18, 22 and 34. Figure 1415 1 and |6] illustrates the error using the intermediate imple- 
mentations and the final implementation produced by our approach. The wordlengths 
in Figure m are ■mypi{2,3), radius{l,8), i(2, 11), area(4, 11), those in Figure |5] 
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Fig. 5: Iteration 3 



are mypi(2, 5), radius{l,8), t(2, 11), area(4, 12) and in Figure |6] are mypi{2,3), 
radius{l,9), t{2, 11), area(4, 10). We observe that the number of inputs violating 
correctness constraint reduces after each iteration. This illustrate how our algorithm 
works by identifying a few representative inputs violating correctness condition in each 
iteration and adding that to the set of examples for which we synthesize the least cost 
implementation. 



4.4 Theoretical Results 

The following theorem summarizes the correctness and optimaUty guarantees of our 
approach. 

Theorem 1. The synthesis procedure presented in Procedure\3\is guaranteed to syn- 
thesize the fixed-point program which is of minimal cost and satisfies the correctness 
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condition for accuracy if optimization oracles Os and Oy find globally-optimal solu- 
tions (when they exist). 

Proof. We first prove correctness and then optimality of the obtained solution. Consider 
Equation in 

3X e Dom{X) Err{Ff.^{X,ix.T),Ffp{X)) > maxError 

If Ov finds globally-optimal solutions, it will find the X e Dom{X) that maximizes 
Err{Ffx{X, fxr), Ffp{X)), and hence determine where or not Equation |2] is satisfi- 
able. Thus, the correctness condition for accuracy is satisfied. 

Next, let fxr* {y^ ±) be the fixed-point type returned by Procedure|3] Let us assume 
that there exists a fixed-point type fxr' with a cost lower than fxr* which also satisfies 
the correctness condition: 

\/X e Dom{X) Err{Ffa;{X,ixT'),Ffp{X)) < maxError 
A cost(fxT') < cost(fxr*) 

Hence, for any D C Dom{X), 

VX e D Err{Ff^{X,iyiT'),Ff.p{X)) < maxError 
A cost(fxT') < COSt(fXT*) 

But fxT* is the solution generated by applying Os to the optimization problem of 
Equation [1] 

Minimize cost(fxT) s.t. 
/\ Err{Ff^{x,fxT),Ffi{x)) < maxError (3) 

Since Os is guaranteed to generate globally-optimal solutions, setting D = S, we 
obtain a contradiction. Hence, there exists no fixed-point type fxr' with a cost lower 
than fxr* which also satisfies the correctness condition. Hence, fxr* is the optimal 
correct solution. 



As noted earlier, it is difficult to implement ideal Os and Oy (that find global op- 
tima) with current SAT and optimization methods for arbitrary floating-point programs. 
Nonetheless, our experience with heuristic methods that find local optima has been very 
good. Also, improvements in optimization/SAT methods can directly be leveraged with 
our inductive synthesis approach. In contrast, the current techniques for synthesizing 
fixed-point versions of floating-point programs perform heuristic optimization over a 
randomly selected set of inputs (see Sec.|6]for a detailed discussion). Such techniques 
do not provide any correctness guarantees and the number of inputs needed could be 
much larger Our approach systematically discovers a small number of example inputs 
such that the optimal fixed-point program for this set yields that for the entire input 
domain. 



5 Experiments 

Apart from the running example, we present case studies from DSP and control 
systems to illustrate the utility of the presented synthesis approach. Our technique 
was implemented in Matlab, and Nelder-Mead implementation available in Matlab as 
f minsearch function was used for numerical optimization. We use the Constantinides 
et al 161 cost model. 



5.1 Running Example 

We illustrate the synthesis approach (more details in ifTSI ) presented in Section 2] us- 
ing the running example. Our algorithm used 34 examples and needed 4 iterations. To 
evaluate our approach, we exhaustively simulated the generated fixed-point program 
on the given domain (0.1 < radius < 2) at intervals of 0.0001. The is presented in 
Figure |5?T] 
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Fig. 7: Our Approach on Running Example. 



As a point of comparison, we also show the result of synthesizing a fixed-point pro- 
gram using the optlnduce routine with 100 inputs (3 times as many as our approach) 
selected uniformly at random (Figure ISTt . The horizontal line in the plots denotes the 
maximum error threshold of 0.01 on the relative difference error function. The cost of 
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Fig. 8: Running Example Using Random Inputs. 



the fixed-point program synthesized with random sampHng is 89.65, and the fixed-point 
types of the variables are fxr(inypi) = (0,2,3), fxr(radius) = (0, 1,8), fxT(t) = 
(0, 2, 10) and fxr(area) = (0, 4, 8). Notice, however, that it is incorrect for a large 
number of inputs. In contrast, the cost of the implementation produced using our tech- 
nique is 104.65, and the fixed-point types of the variables are fxr(mypi) — (0, 2, 3), 
fxr(radius) = (0, 1, 9), fxT(t) = (0, 2, 11) and fxr(area) = (0, 4, 10). 

5.2 Infinite Impulse Response (IIR) Filter 

The first case study is an IIR filter which is used in digital signal processing applications. 
It is a first-order direct form-II IIR filter with the schematic shown in Figure|9] The con- 
stants are ai = —0.5, bO = 0.9 and bl = 0.9. The fixed-point variables are identified in 
the schematic. We use our synthesis technique to discover the appropriate fixed-point 
types of these variables. The input domain used in synthesis is —2 < input < 2. 
The correctness condition for accuracy is to ensure that the relative error between the 
floating-point and fixed-point program is less than 0.1. 
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Fig. 9: IIR Filter Schematic 




Fig. 10: IIR Filter Using Floating-point and Fixed-point. In the top plot, the floating-point 
and fixed-point outputs are virtually superimposed on each other. 



In order to test the correctness of our implementation, we feed a common input 
signal to both the IIR filter implementations: floating-point version and the fixed-point 
version obtained by our synthesis technique. The input signal is a linear chirp from to 

Hz in 1 second. 

Fs 

input = (1 — 2^"^^) X sin{n x — x i^) 

where Fs = 256 and t = to 1 — and is sampled at intervals of . Figure [TO] 
shows the input, outputs of both implementations and the relative error between the two 
outputs. We observe that the implementation satisfies the correctness condition and the 
relative error remains below 0.1 throughout the simulation. 



5.3 Finite Impulse Response (FIR) Filter 

The second case study is a low pass FIR filter of order 4 with tap coefficients 
0.0346,0.2405,0.4499,0.2405 and 0.0346. The input domain, correctness condition 
and input signal to test the floating-point implementation and synthesized fixed-point 
program are same as the previous case study. Figure [iTIshows the input, outputs of both 
implementations and the relative error between the two outputs. We observe that the 
implementation satisfies the correctness condition and the relative error remains below 
0.1 throughout the simulation. 



5.4 Field Controlled DC Motor 



The next case study is a field controlled DC Motor It is a classic non-linear control 
example from Khalil lfT4l . The example used here is an adaptation of Khalil's example 
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Fig. 11: FIR Filter Using Floating-point and Fixed-point. The floating-point/fixed-point 
outputs are virtually superimposed on each other 



presented in [?]. The system dynamics is described by the following ordinary differen- 
tial equations. 

= Rfif + Lfif 

Va = CiifUJ + Laia + RJa 
JUJ = 021 fia - C3W 

The first equation is for the field circuit with Vf,if,Rf,Lf being its voltage, cur- 
rent, resistance, and inductance. The variables Va, ia , Ra , La are the corresponding volt- 
age, current, resistance, and inductance of the armature circuit described by the second 
equation. The third equation is a torque equation for the shaft, with J as the rotor in- 
ertia and C3 as a damping coefficient. The term ciifcu is the back electromotive force 
induced in the armature circuit, and fia is the torque produced by the interaction of 
the armature current with the field circuit flux. In the field controlled DC motor, field 
voltage V/ is the control input while Va is held constant. The purpose of the control is 
to drive the system to the desired set point for the angular velocity to. 

We can now rewrite the system dynamics in the following normal form where a = 

if = ~o-if + u 

ia = -bia + p - cifUJ 

UJ ~ Oifia — dw 

We assume no damping, that is, C3 = and set all the other constants a, b, c, 6, p to 
1 [?]. The state feedback law to control the system is given by 

9{a + b)ifia + Opif — cOi'iui 
= W- 



where e = 0.01 is added to denominator to avoid division by 0. ia ap- 
proaches at equiUbrium. The corresponding floating-point code is shown be- 
low. 

Input: if , ia, oj, p, c, e 
Output: u 

tl = e X ia; t2 = e + tl; t3 = l/t2; t31 = a + b; 
t32 = if X ia; t33 = t31 X t32; t4 = X t33; t41 = p X if ; 
t5 = 9 X t41; t6 = if X if; t61 = t6 x lj; t62 = t61 x 6; 
t7 = c X t62; t8 = t4 + 15; t9 = t8 - t7; u = t3 X t9; 
return u 

The system is initialized with field current if ^ 1, armature current ia = 1 and angular 
velocity a; = 1. 




2 4 6 8 10 

Time 



Fig. 12: DC Motor Using Floating-point and Fixed-point Controller with zoomed-in 
view for 2 to 3 seconds. 

The computed control law can be mathematically shown to be correct by designers 
who are more comfortable in reasoning with real arithmetic but not with finite precision 
arithmetic. Its implementation using floating-point computation also closely mimics the 
arithmetic in reals but the control algorithms are often implemented using fixed-point 
computation on embedded platforms. We use our synthesis technique to automatically 
derive a low cost fixed-point implementation of the control law computing u. The input 
domain is < ia,if,uj < 1.5. The correctness condition for accuracy is that the abso- 
lute difference between the u computed by fixed-point program and the floating-point 
program is less than 0.1. 

Figure [T2] shows the simulation of the system using the fixed-point implementa- 
tion of the controller and the floating-point implementation. This end-to-end simulation 
shows that fixed-point program generated by our technique can be used to control the 
system as effectively as the floating-point program. This illustrates the practical utility 
of our technique. Figure [T3]plots the difference between the control input computed by 
the fixed-point program and the floating-point program. It shows that the fixed-point 
types synthesized using our approach satisfy the correctness condition, and the differ- 
ence between the control input computed by the fixed-point and floating-point program 



is within the specified maximum error threshold of 0.1. The number of inputs needed 
in our approach was 127. In contrast, the fixed-point types found using 635(5X our ap- 
proach) randomly selected inputs violate the correctness condition for a large number 
of inputs. 
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Fig. 13: Error in Control Input Using Fixed-point and Floating-point Program 



5.5 Two- Wheeled Welding Mobile Robot 

The next case study is a nonlinear controller for a two-wheeled welding mobile robot 
(WMR) The robot consists of two wheels and a robotic arm. The wheels can roll 
and there is no slipping, (x, y) represents the Cartesian coordinate of the WMR's center 
point and cj) is the heading angle of the WMR. v and uj are the straight and angular 
velocities of the WMR at its center point. The welding point coordinates {xw,yw) and 
the heading angle 0^, can be calculated from the WMR's center point: 

X u, ^ X — I sin (j) 
Vw + lcos4> 

4>w = 4> 

So, the equation of motion for the welding point is as follows: 

Xw = V COS (p — luj COS (j) — / sin 
ijw = V sin (p — luj sin (p + I cos 4> 

(pw =UJ 

The objective of the WMR controller is to ensure that the robot tracks a reference 
point R. The reference point R moving with a constant velocity of Vr on the reference 
path has coordinates {xr,yr) and the heading angle (pr ■ The tracking error is the differ- 
ence between the location of the robot and the reference point. 
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The two control parameters in the model are v and uj. In order to ensure that the 
error quickly converges to 0, a nonlinear controller based on Lyapunov stability is as 
follows: 



V = l{LJr + k2e2Vr + sin 63) + Vr cos 63 + fclBi 
UJ ^ ujr + k2e2Vr + k^ sin 63 

where ki,k2 and k^ are positive constants. Table |3] provides the numerical values of 
constants and initial values of the state variables from Bui et al [3 |. All lengths are in 
meters, angle in radians and time in seconds. 



Table 3: Numerical and Constant Values 
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Fig. 14: Reference Welding Line 



We use our synthesis technique to automatically synthesize fixed-point program 
computing both control inputs: v and uj. The error function used for v is the relative 
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Fig. 15: Distance of WMR from Reference Line with zoomed-in view for 80 to 120 
seconds. 
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Fig. 16; Error in computing v 
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and the error function used for w is the moderated relative difference 
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where 5 ~ 0.001. The moderated relative difference is useful here since uj floating-point 
can be 0. We require that the difference values for both controllers are less than 0.1. 
Figure [T4l shows the reference line for welding and Figure [15] shows the distance of 
the WMR from the reference line as a function of time for both cases: firstly, when the 
controller is implemented as a floating-point program and secondly, when the controller 
is implemented as a fixed-point program synthesized using our technique. The robot 
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Fig. 17: Error in computing cu 
Table 4: Performance 



Case-study 


Runtime (seconds) 


# of Iterations 


IIR Filter 


268 


5 


FIR Filter 


379 


4 


DC Motor u 


4436 


8 


WMRt; 


2218 


7 




1720 


4 



starts a little away from the reference line but quickly starts tracking the line in both 
cases. Figure [16] and Figure [T7]show the error between the floating-point controller and 
fixed-point controller for both control inputs: v and uj, respectively. 

TablelHsummarizes the performance of our technique in the four case-studies. 



6 Related Work 

Previous techniques for optimizing fixed-point types are based on statistical sampling 
of the input space. These methods sample a large number of inputs and heuristically 
solve an optimization problem that minimizes implementation cost while ensuring that 
some correctness specification is met over the sampled inputs. The techniques differ 
in in the heuristic search method employed, in the measure of cost, or in how accu- 
racy of fixed-point implementation is determined. Sung and Kum use a heuristic 
search technique which starts with the minimum wordlength implementation as the ini- 
tial guess. The wordlengths are increased one by one till the error falls below an accept- 
able threshold. Han et al. IIOII II use a gradient-based sequential search method which 
starts with the minimum wordlength implementation as the initial guess. The gradient 
(ratio of increase in accuracy and increase in wordlengths) is computed for a set of 
wordlength changes at each step and the search moves in the direction with maximum 
gradient. Shi et al. [20j propose a floating-point to fixed-point conversion methodology 
for digital VLSI signal processing systems. Their approach is based on a perturbation 
theory which shows that the change to the first order is a Hnear combination of all 



the first- and second-order statistics of the quantization noise sources. Their technique 
works with general specification critera, as long as these can be represented as large en- 
semble averages of functions of the signal outputs. For example, they use mean-squared 
error (MSB) as the specification function. The cost of the implementation is a quadratic 
function. Monte Carlo simulation of a large number of input examples is used to for- 
mulate a quadratic optimization problem based on perturbation theory. In contrast, our 
specification requires that the accuracy condition holds for all inputs and not just on an 
average. Further, the cost function can be any arbitrary function for our technique and 
need not be quadratic. Perhaps most importantly, our technique does not rely on apriori 
random sampling of a large number of input values, instead using optimization to dis- 
cover a small set of interesting examples which suffice to discover optimal fixed-point 
implementation. 

Purely analytical methods 1211151 based on dataflow analysis have also been pro- 
posed for synthesizing fixed-point programs based on forward and backward propaga- 
tion in the program's dataflow graph. The advantages of these techniques are that they 
do not rely on picking the right inputs for simulation, can handle arbitrary programs 
(with approximation), and can provide correctness guarantees. However, they tend to 
produce very conservative wordlength results. 

Inductive synthesis based on satisfiability solving has been previously used for syn- 
thesizing programs from functional specifications. These approaches 112191 rely on con- 
straint solving in much the same way as we rely on optimization routines. However, 
these approaches only seek to find a correct program, without any notion of cost and 
optimization. In contrast, our technique is used to find a fixed-point program which is 
not only correct with respect to a condition on accuracy but is also of minimal cost. 

7 Conclusion 

In this paper, we presented a novel approach to automated synthesis of fixed-point pro- 
gram from floating-point program by discovering the fixed-point types of the variables. 
The program is synthesized to satisfy the provided correctness condition for accuracy 
and to have optimal cost with respect to the provided cost model. 
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